1 Introduction

1.1 Input files:

  • (AACT) aact_studies.tsv
  • (AACT) aact_drugs.tsv
  • (LeadMine) aact_drugs_leadmine.tsv
  • (PubChem) aact_drugs_smi_pubchem_cid.tsv
  • (PubChem) aact_drugs_smi_pubchem_cid2inchi.tsv
  • (ChEMBL) aact_drugs_inchi2chembl.tsv
  • (ChEMBL) aact_drugs_chembl_activity_pchembl.tsv
  • (ChEMBL) aact_drugs_chembl_target_component.tsv
  • (TCRD/Pharos) pharos_targets.tsv

nct_id is the study ID.

## [1] "Thu Mar 28 17:07:27 2019"
library(readr)
library(data.table)
library(plotly, quietly=T)

2 Input studies and drugs

2.1 Studies

Read file of all studies in AACT.

## [1] "Total studies: 300214 ; unique NCT_IDs: 300214"

2.2 Drugs

Read file of all drugs in AACT.

  • id is AACT ID.
  • Note that one study may involve multiple drugs.
  • At this point a “drug” is identified by a name.
## [1] "Unique drug names: 91347 ; unique intervention IDs: 255077"

2.3 Studies: Interventional drug studies only

Select only Interventional studies (study_type) associated with drugs (via nct_id).

## [1] "Interventional studies: 237892 (79.2%)"
## [1] "Interventional drug studies: 124421 ; unique NCT_IDs: 124421"
Drug studies and drugs, by phase
phase N_studies N_drugs
Early Phase 1 1574 2615
Phase 1 23603 48593
Phase 1/Phase 2 6663 13288
Phase 2 33910 68850
Phase 2/Phase 3 3305 6503
Phase 3 22988 49507
Phase 4 19593 36331
NA 12785 29390
Drugs (itv_ids), by study overall_status
overall_status N
Completed 145006
Recruiting 33973
Terminated 19618
Unknown status 18463
Active, not recruiting 13962
Not yet recruiting 8001
NA 7080
Withdrawn 6969
Enrolling by invitation 1060
Suspended 945

2.4 Drugs by study start_year

(To do: stack with study start_year.)

## Warning: Ignoring 1 observations

## Warning: Ignoring 1 observations

2.5 Drug-trials by Phase and Status

3 NextMove Leadmine NER

AACT drug names resolved to standard names and structures via SMILES. Now we can use cheminformatically rigorous counts for drugs as active pharmaceutical ingredients (APIs).

## [1] "Drug unique SMILES resolved by LeadMine: 4699 ; unique intervention IDs: 171741"
## [1] "Drugs (drug names) with resolved structure: 180555 / 197300 (91.5%)"

3.1 NER mentions by intervention ID.

## [1] "Mentions by intervention ID: 157862 / 171741 (91.9%)"

3.2 NER mentions by trial (NCT ID).

## [1] "Mentions by study: 92966 / 99647 (93.3%)"

3.3 NER mentions by drug, i.e. name in AACT.

## [1] "Mentions by drug name: 11108 / 58297 (19.1%)"

4 PUBCHEM:

4.1 Intervention IDs to CIDs from PubChem (via SMILES)

## [1] "PubChem SMILES2CID hits: 3960 / 4698 (84.3%)"
## [1] "Intervention IDs mapped to PubChem CIDs (via SMILES): 153876"

4.2 InChIKeys from PubChem (via CIDs)

## [1] "PubChem CIDs with InChIKeys: 3801"

5 CHEMBL:

5.1 ChEMBL molecule IDs, and properties (via InChIKeys)

## [1] "ChEMBL compounds mapped via InChIKeys: 3332"

5.2 ChEMBL activities for mapped compounds

Select only activities with pChembl values for confidence.

## [1] "ChEMBL activities: 124438"
## [1] "ChEMBL activities molecules: 2287 ; targets: 3832 ; documents: 16198"

5.3 ChEMBL targets (via activities)

## [1] "ChEMBL target proteins: 3157"

6 IDG/TCRD:

## [1] "ChEMBL target proteins mapped to TCRD (human): 1806"

6.1 Targets by organism (top 10):

## [1] "Organisms: 187"
Targets by organism (top 10)
organism N_targets
Homo sapiens 1806
Rattus norvegicus 529
Mus musculus 238
Bos taurus 98
Sus scrofa 36
Cavia porcellus 26
Escherichia coli K-12 19
Oryctolagus cuniculus 18
Escherichia coli 17
Mycobacterium tuberculosis 17

6.2 Human single-protein targets only.

## [1] "Human targets: 1806"
target_type N
SINGLE PROTEIN 1216
PROTEIN COMPLEX 247
PROTEIN FAMILY 210
PROTEIN COMPLEX GROUP 91
PROTEIN-PROTEIN INTERACTION 16
SELECTIVITY GROUP 14
CHIMERIC PROTEIN 12
## [1] "Human single-protein targets: 1216 ; unique UniProts: 1216"

6.3 Targets by IDG Target Development Level (TDL):

## [1] "   Tchem:    733" "   Tclin:    341" "    Tbio:    140"
## [4] "   Tdark:      2"